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Abstract 

Background: Genetic mutation, selective pressure for translational efficiency and accuracy, level of gene 
expression, and protein function through natural selection are all believed to lead to codon usage bias (CUB). 
Therefore, informative measurement of CUB is of fundamental importance to making inferences regarding gene 
function and genome evolution. However, extant measures of CUB have not fully accounted for the quantitative 
effect of background nucleotide composition and have not statistically evaluated the significance of CUB in 
sequence analysis. 

Results: Here we propose a novel measure-Codon Deviation Coefficient (CDC)-that provides an informative 
measurement of CUB and its statistical significance without requiring any prior knowledge. Unlike previous 
measures, CDC estimates CUB by accounting for background nucleotide compositions tailored to codon positions 
and adopts the bootstrapping to assess the statistical significance of CUB for any given sequence. We evaluate 
CDC by examining its effectiveness on simulated sequences and empirical data and show that CDC outperforms 
extant measures by achieving a more informative estimation of CUB and its statistical significance. 

Conclusions: As validated by both simulated and empirical data, CDC provides a highly informative quantification 
of CUB and its statistical significance, useful for determining comparative magnitudes and patterns of biased codon 
usage for genes or genomes with diverse sequence compositions. 

Keywords: Codon deviation coefficient, CDC, Codon usage bias, CUB, Statistical significance, Background nucleo- 
tide composition, GC content, Purine content, Bootstrapping 



Background 

Codon usage bias or CUB, a phenomenon in which 
synonymous codons (that encode the same amino acid) 
are used at different frequencies, is generally believed to 
be a combined outcome of mutation pressure, natural 
selection, and genetic drift [1-5]. Within any given 
species, genes often exhibit variable degrees of CUBs. 
Moreover, CUB for an individual gene is related closely 
with gene expression for translational efficiency and/or 
accuracy [6-10]. Therefore, the ability to accurately quan- 
tify CUBs for protein-coding sequences is of fundamental 
importance in revealing the underlying mechanisms 
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behind codon usage and understanding gene evolution 
and function in general. 

Over the past few years, a number of measures have 
been proposed for the quantification of CUB [11-23], lead- 
ing to investigations on the pattern of CUBs within and 
across species [24-30]. Since CUB is primarily shaped by 
selection and mutation [5], different measures are differen- 
tially informative with regard to differentiating causes. For 
instance, there are purely descriptive measures of CUB as 
caused by the joint effects of mutation and selection, such 
as, the Effective Number of Codons (N c or ENC) [13] and 
the Relative Synonymous Codon Usage [22] . Alternatively, 
other measures of CUB specifically accord with selection 
on codon usage associated with translation, such as, the 
Codon Adaption Index (CAI) [12] and the Frequency of 
Optimal codons [15]. In addition, a number of studies 
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have attempted to estimate selection on codon usage 
based on population genetics [31-35]. 

These existing measures generally fall into two cate- 
gories, as they compare the observed codon usage distri- 
bution of target coding sequence against the distribution 
based on a reference set of highly-expressed genes (e.g., 
CAI) or the distribution based on a null hypothesis of 
uniform usage of different synonymous codons (e.g., N c ). 
The former measures are highly dependent on their cor- 
responding reference sets (from which preferred codons 
are derived) and accordingly are limited by the compre- 
hensiveness and accuracy of reference sets. Since refer- 
ence sets are species-specific, these measures are 
inappropriate for comparison of CUBs across species 
[36]. Additionally, they are unreliable in cases where 
there is inadequate knowledge about the highly- 
expressed genes for a given species [37], such as for 
newly sequenced species that have a limited number of 
annotated genes. 

Due to these shortcomings, measures that do not 
require prior knowledge of reference gene sets have been 
implemented. These measures assume a null distribution 
of uniform usage of synonymous codons and estimate 
the departure of the observed codon usage from the 
expected. Among them, N c is one of the most widely 
used measures [13]. Its variant, N c ' [19], incorporates GC 
content of coding sequence as background nucleotide 
composition (BNC) into CUB estimation. Accounting for 
BNC refines codon usage analysis, providing a compar- 
able metric for analyses within and among species exhi- 
biting various non-uniform BNCs. In the context of 
protein-coding sequences, for instance, bacteria have 
diverse BNCs as their GC contents vary widely - from 
~20% to ~80%. Even within a single species, genes often 
differ considerably in background GC content, as in the 
case of Escherichia coli str. K-12 substr. MG1655, whose 
genes have GC contents ranging from 26.9% (rfaS; length 
= 311aa) to 66.8% (yagF; length = 655aa). Therefore, it is 
crucial to measure the departure of codon usage from 
the corresponding background composition (instead of 
the presumed uniform codon usage). Due to its appropri- 
ate consideration of BNC, N c ' outperforms other relevant 
measures [19]. 

However, all extant measures (including N c ) still have 
limitations. First, they give a general estimate of CUB, but 
have not been supplied with straightforward procedures 
for assessing the statistical significance of the bias in 
codon usages for any given gene. Genes that vary in length 
and differ in CUB may exhibit different levels of statistical 
significance for their codon biases. Assessing statistical sig- 
nificance can strengthen functional relationships ascer- 
tained considerably by discounting sampling error in 
correlated gene sets. Second, no previous measure is fully 
effective at incorporating BNC into CUB estimation. 



Although N c ' factors GC content as BNC, it does not 
account for known variation in BNCs at three different 
codon positions [38]. In bacteria, for instance, Bartonella 
quintana str. Toulouse and Clostridium thermocellum 
ATCC 27405 have very similar GC contents in coding 
sequences (40.5% and 40.4%, respectively), but their posi- 
tion-specific GC contents are quite different: 53.3% and 
47.3% at the first codon position, 38.6% and 34.0% at the 
second codon position, and 29.5% and 39.9% at the third 
codon position, respectively. Likewise, genes within a 
given species can also have heterogeneous BNCs at the 
three codon positions; in E. coli, for example, there are 
two genes, emrE and hlyE, that are similar in their overall 
GC contents (41.5% and 41.1%) but different in positional 
GC contents: 42.7% and 48.2% at the first position, 46.4% 
and 32.0% at the second position, and 35.5% and 43.2% at 
the third position, respectively. Such differences in posi- 
tional BNCs reflect the outcomes of diverse evolutionary 
mechanisms (e.g., dinucleotide bias [39], horizontal gene 
transfer [40], strand compositional asymmetry in bacteria 
[41], isochore structure in vertebrates [42], etc.), thus con- 
flating the roles of mutation and selection acting at differ- 
ent codon positions. Therefore, incorporation of 
differential positional BNCs into CUB estimation promises 
to increase its effectiveness and reliability. 

Moreover, GC content is not the sole parameter of BNC. 
As illustrated in Zhang and Yu [43], joint use of GC and 
purine contents effectively models nucleotide, codon, and 
amino acid compositions. In contrast to a broader varia- 
tion of GC content, purine content varies within a much 
narrower range fluctuating around 50%, presumably 
because purines play a determinative role in physicochem- 
ical properties of amino acids [44,45]. Similar with GC 
content, purine content differs not only from one species 
to another, but also from one gene to another, and even 
between genes with similar GC contents. For instance, 
emrE and hlyE in E. coli, which are similar in their overall 
GC contents, have entirely different purine contents not 
only at the overall level (45.8% and 55.6%, respectively), 
but also at three codon positions (54.5% and 68.3% at the 
first position, 34.5% and 48.2% at the second position, and 
48.2% and 50.2% at the third position, respectively). Thus, 
in addition to GC content, purine content is also a signifi- 
cant feature of BNC. 

Here we present a novel measure, Codon Deviation 
Coefficient (CDC), using it to characterize CUB and to 
ascertain its statistical significance. CDC takes account of 
both GC and purine contents, comprehensively addres- 
sing heterogeneous BNCs, not only in sequences but also 
at three codon positions. It adopts the cosine distance 
metric to quantify CUB and employs the bootstrapping 
to assess its statistical significance, requiring no prior 
knowledge of reference gene sets. We describe CDC in 
detail and provide comparative results in the form of an 
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in-depth evaluation of simulated sequences and empirical 
data. 

Methods 

Expected codon usage 

CDC considers both GC and purine contents as BNC 
and derives expected codon usage from observed posi- 
tional GC and purine contents. We denote the content 
of the four nucleotides (adenine, thymine, guanine, and 
cytosine), GC content, and purine content as A, T, G, C, 
S and R, respectively. As in Zhang and Yu [43], posi- 
tion-dependent nucleotide contents can be formulated 
in the following way: 

A, = (1 - S,)Ri, T, = (1 - S,)(l - R,0, Q = S t Ri, Q = S,(l - R,), (1) 

where 5, and Ri are their corresponding observed con- 
tents at codon position i and A b T b G„ Q are expected 
nucleotide contents at codon position i (i = 1, 2, 3). For 
any sense codon xyz, where x, y, z e {A, T, G, Q, the 
expected usage nxyz is defined as the product of its con- 
stituent expected nucleotide contents x^z^, normalized 
by the sum over all sense codons, viz. 



X1Y2Z3 



'■xyz. ■ 



abc 



(2) 



Statistical significance of codon usage bias 

We implement a bootstrap resampling of N = 10000 
replicates for any given sequence to evaluate the statisti- 
cal significance of non-uniform codon usage. Each repli- 
cate is randomly generated according to the sequence 
BNC {Si and R t , i = 1, 2, 3) and the sequence length. 
Consequently, we obtain a bootstrap distribution of N 
estimates of CUB. A two-sided bootstrap P-value is cal- 
culated as twice the smaller of the two one-sided P- 
values [47]. P ranges from 0 to 1. By convention, a sta- 
tistically significant CUB is identified by P < 0.05. CDC 
features its first application of the bootstrap resampling 
in estimating the statistical significance of CUB. Boot- 
strapping may also be applicable to other related 
measures. 

Implementation and availability 

CDC is written in standard C++ programming language 
and implemented into Composition Analysis Toolkit 
(CAT), which is distributed as open-source software and 
licensed under the GNU General Public License. Its 
software package, including compiled executables on 
Linux/Mac/ Windows, example data, documentation, and 
source codes, is freely available at http://cbb.big.ac.cn/ 
software and http://cbrc.kaust.edu.sa/CAT. 



where w a b c 



l,if abc is a sense codon 
0, otherwise 



Results and discussion 

and a b c e (A T G c}^ om P arat ' ve analysis on simulated data 



Codon usage bias 

Any coding sequence can be represented as a vector of 
n dimensions, whose entries correspond to n sense 
codon usages in the sequence. The dimension n equals 
61 for the canonical code; although codons ATG and 
TGG could be set aside due to the absence of synon- 
ymous codons, calculation based on a vector of 61 
dimensions instead of 59 dimensions makes little sub- 
stantial difference. To calculate CUB for any given 
sequence, we employ the cosine distance metric [46] 
based on the cosine of the angle between the two vec- 
tors of n dimensions. Therefore, when both expected (n) 
and observed {ft) codon usage vectors are available for 
any given sequence, CDC renders a distance coefficient 
ranging from 0 (no bias) to 1 (maximum bias), to repre- 
sent CUB, expressed by the deviation of ft from tt (Eq. 
3). 



CDC = 1 



^2 izxyz x nxyz 

xyz 

^nxyz 2 x ^jixyz 1 

xyz xyz 



To evaluate the performance of CDC and compare it 
against the most powerful extant measure, N c ', as well 
as N c , we took an approach based on that of Novembre 
[19] to simulate coding sequences specifying different 
positional BNCs and varying sequence lengths. Five sets 
of position-associated compositions were used to gener- 
ate simulated sequences (Table 1). It should be noted 
that CDC ranges from 0 (no bias) to 1 (maximum bias), 
whereas N c ' and N c range from 20 (maximum bias) to 
61 (no bias). To facilitate comparisons of CDC with N c ' 
and N c , we use the formula (61- N c )/4,1 and (61- A/ c )/41 
to rescale their ranges, denoted as scaled N c ' and scaled 
N c , respectively, from 0 (no bias) to 1 (maximum bias). 

A good measure should not deviate much from its 
expectation as the amount of data approaches infinity or 
any sufficiently large number. Thus, we first simulated 
sequences with a total of 100,000 codons using five 
positional composition sets (PCSs) (Table 1). 

Table 1 Background nucleotide compositions at three 
codon positions specified in simulations 



(3) 



Content 


None 


Low 


Med-1 


Med-2 


High 


1st position 


0.5 


05 


0.5 


0.5 


0.5 


2nd position 


05 


0.4 


0.3 


0.2 


0.1 


3rd position 


05 


0.6 


0.7 


0.8 


0.9 
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Considering the fact that both GC and purine contents 
govern BNC, we fixed one of them to be uniform at 
three codon positions and allowed the other to have 
various positional compositions. We examined heteroge- 
neous positional compositions for GC (Figure 1A to 1C) 
and purine (Figure ID to IF) contents, respectively. 
Consistent with expectations, when the PCS was 
uniform, CDC and scaled N c ' performed similarly, both 
taking a value close to 0 (Figure 1). When the heteroge- 
neity of positional composition increased for GC con- 
tent (Figure 1A to 1C), CDC continued to perform well 
for all cases examined, whereas scaled N c ' and scaled N c 
generated biased estimates, especially in cases where 
there was high heterogeneity in positional BNCs. Simi- 
larly, when purine content had heterogeneous positional 
compositions (Figure ID to IF), CDC again exhibited 
much lower biases than scaled N c ' and scaled N c . Since 
N c ignores BNC, N c ' performed better than N c when the 
PCS was non-uniform (Figure 1A, C, D and IF) and 
they exhibited comparable estimates only in cases where 
the PCS was uniform (Figure IB and IE). These results 
agree well with those of Novembre [19]. In addition, 
when we set heterogeneous positional BNCs for both 
GC and purine contents, CDC consistently outper- 
formed N c ' and N c for nearly all the parameter combina- 
tions tested (Table 2). 

To evaluate CDC in a comprehensive manner, we also 
examined all possible quantitative relationships among 



positional GC contents (Table 3), although there are 
identified patterns about quantitative relationships 
among positional nucleotide compositions (e.g., GC con- 
tent at the 1st codon position tends to be always larger 
than that at the 2nd codon position [48]). On the whole, 
CDC achieved greater power than scaled N c ' and scaled 
N c across all examined cases. Scaled N c ' performed better 
than scaled N c , consisting again with the analysis 
reported by Novembre [19]. Similar results were also 
obtained when we considered all possible quantitative 
relationships among positional purine contents (Table 4). 

To examine the effect of variable sequence length on 
the integrity of CDC, we considered a wide range of 
sequence lengths from 100 to 3,000 codons. We set 
both GC and purine contents to be heterogeneous at 
three codon position using the four non-uniform PCSs 
(Table 1). To avoid stochastic errors, we repeated simu- 
lations 10,000 times for each parameter combination 
and thus each estimate was determined from 10,000 
replicates. Overall, CDC performed better than N c ' and 
N c across all sequence lengths examined (Figure 2). 
When the heterogeneity of BNC increased from low to 
high, CDC tended to have less biases, whereas N c ' and 
N c produced increasingly biased estimates, especially for 
the case where there was high heterogeneity in posi- 
tional BNCs (Figure 2D). For short sequences (<300 
codons), CDC yielded much lower biases and smaller 
standard deviations (SD) than N c ' and N c , although all 
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Figure 1 Codon usage bias across a variety of positional background nucleotide compositions. Heterogeneous positional background 
compositions were considered for GC content (panels A to C) and purine content (panels D to E), respectively. The expected values of codon 
usage bias are zero for all examined cases. 
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Table 2 Codon usage bias across a variety of positional 
background compositions for GC and purine contents 


GC Content 


Purine Content 


CDC 


Scaled N c 


Scaled N c ' 


None 


None 


0.00452 


0.00001 


0.00186 




Low 


0.00407 


0.04843 


0.05557 




Med-1 


0.00302 


0.15130 


0.15968 




Med-2 


0.00164 


0.28613 


0.29389 




High 


0.00054 


0.40797 


0.4 I 1 46 


Low 


None 


0.00452 


0.05505 


0.04181 




Low 


0.0041 1 


0.09548 


0.08752 




Med-1 


0.00305 


0.1 9808 


0.19091 




Med-2 


0.00164 


0.31892 


0.31461 




High 


0.00060 


0.44778 


0.44199 


Med-1 


None 


0.00486 


0.20367 


0.17790 




Low 


0.00438 


0.23485 


0.21 262 




Med-1 


0.00305 


0.31876 


0.29478 




Med-2 


0.00203 


0.4285 1 


0.40322 




High 


0.00054 


0.53585 


0.51978 


Med-2 


None 


0.00529 


0.38525 


0.36068 




Low 


0.00460 


0.40628 


0.38358 




Med-1 


0.00337 


0.47542 


0.43927 




Med-2 


0.00182 


0.56759 


0.52569 




High 


0.00056 


0.65842 


0.62645 


High 


None 


0.00606 


0.56671 


0.54706 




Low 


0.00520 


0.59091 


0.56666 




Med-1 


0.00371 


0.65926 


0.61789 




Med-2 


0.00225 


0.71856 


0.66928 



High 0.00065 0.77246 0.73600 



Sequences with 100000 codons were simulated. The expected value of codon 
usage bias is zero so that these estimated values are also the deviations from 
the expected. 

three measures produced estimates that are somewhat 
biased. To obtain more reliable estimates of CUB, our 
results suggest that input sequences should have at least 
100 codons in length. When sequence length was 
decreased below 100 codons, CDC still performed better 



than N c ' and N c , although the biases of N c ' and N c were 
in opposite directions as compared with those of CDC 
(Figure 2B to 2D; not apparent in Figure 2A). For long 
sequences, CDC generated less biased estimates and 
SDs, whereas N c ' and N c continued to yield more biased 
estimates and SDs. 

To test the influence of different CUBs on the power of 
CDC, we evaluated a range of CUBs from low to high. 
Unlike the previous simulations (which are based on 
nucleotide compositions), we generated simulated 
sequences by randomly setting different synonymous 
codon frequencies and considering variable CUBs with a 
range from 0.1 to 0.9. We repeated simulations 1,000 
times for each case and accordingly each estimate was 
averaged over 1,000 replicates. On the whole, CDC 
exhibited greater power in detecting diverse CUBs; com- 
pared with N c ' and N c , the estimated CUBs of CDC were 
very closer to the expected ones (Table 5). When the 
expected CUBs varied from low to high, CDC performed 
consistently to give rise to close estimates. Contrastingly, 
N c ' and N c yielded biased CUB estimates across all tested 
cases and these biases became more pronounced when 
the expected CUB was extremely low. When the 
expected CUBs increased from low to high, N c ' and N c 
exhibited increasing power in CUB estimation. While 
they approached the power of CDC when the expected 
CUB was high, CDC remained more powerful than N c ' 
and N c . Taken together, our simulation results demon- 
strated that CDC is superior to N c ' and N c . 

Application to empirical data 

It is generally acknowledged that CUB correlates closely 
with gene expression level in both unicellular [6-10] and 
multicellular [11,49-51] organisms. Different species 
may have different heterogeneities in positional BNCs. 
To empirically test CDC and compare it to three popu- 
lar measures, N c ', N c and CAI, we collected multiple 
expression data sets from five different species in this 
study: (1) Escherichia coli from Bernstein et al. [52] (in 



Table 3 Codon usage bias across all possible quantitative relationships among positional GC contents 





GC content 




Purine content 


= 0.3 


Purine content 


= 0.5 


Purine content 


= 0.7 


1st 


2nd 


3rd 


CDC 


Scaled 


Scaled 


CDC 


Scaled 


Scaled 


CDC 


Scaled 


Scaled 












N c ' 




Wc 


Wc' 




Wc 


Wc' 


0.3 


0.5 


0.7 


0.00153 


0.34160 


0.23472 


0.00586 


0.24586 


0.23332 


0.00481 


0.39716 


0.21314 


0.3 


0.7 


0.5 


0.00147 


0.15648 


0.05716 


0.00551 


0.04827 


0.06330 


0.00498 


0.24616 


0.05866 


0.5 


0.3 


0.7 


0.00146 


0.36662 


0.19363 


0.00470 


0.20034 


0.17544 


0.00441 


0.34555 


0.17306 


0.5 


0.7 


0.3 


0.00143 


0.35276 


0.21224 


0.00519 


0.19619 


0.21974 


0.00417 


0.34831 


0.21815 


0.7 


0.3 


0.5 


0.00069 


0.21330 


0.01419 


0.00236 


0.02999 


0.02692 


0.00233 


0.16172 


0.03574 


0.7 


0.5 


0.3 


0.00066 


0.38224 


0.22121 


0.00257 


0.22392 


0.23947 


0.00236 


0.33561 


0.24588 



Sequences with 100000 codons were simulated. The compositions in the Med-1 set (0.3, 0.5 and 0.7) were used. GC content was considered non-uniform at 
three codon positions, whereas purine content was set uniform at three codon positions. The expected value of codon usage bias is zero so that these estimated 
values are also the deviations from the expected. 
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Table 4 Codon usage bias across all possible quantitative relationships among positional purine contents 



Purine content 






GC content = 


0.3 




GC content = 


0.5 




GC content = 


0.7 


1st 


2nd 


3rd 


CDC 


Scaled 


Scaled 


CDC 


Scaled 


Scaled 


CDC 


Scaled 


Scaled 










Nc 


Nc 




Nc 


Nc 




Nc 


Nc' 


0.3 


0.5 


0.7 


0.01743 


0.35780 


0.18606 


0.01023 


0.15974 


0.17789 


0.00232 


0.34949 


0.17267 


0.3 


0.7 


0.5 


0.01836 


0.21922 


0.01880 


0.01036 


0.01515 


0.01520 


0.00263 


0.24157 


0.00941 


0.5 


0.3 


0.7 


0.0061 6 


0.38200 


0.16209 


0.00294 


0.15248 


0.16112 


0.00063 


0.33321 


0.16601 


0.5 


0.7 


0.3 


0.00566 


0.31973 


0.15002 


0.00302 


0.16556 


0.15842 


0.00061 


0.37234 


0.15754 


0.7 


0.3 


0.5 


0.00182 


0.27781 


0.02340 


0.00079 


0.02564 


0.02805 


0.00026 


0.21360 


0.02756 


0.7 


0.5 


0.3 


0.00179 


0.35410 


0.15793 


0.00087 


0.16099 


0.15939 


0.00024 


0.35439 


0.15404 



Sequences with 100000 codons were simulated. The compositions in the Med-1 set (0.3, 0.5 and 0.7) were used. Purine content was considered non-uniform at 
three codon positions, whereas GC content was set uniform at three codon positions. The expected value of codon usage bias is zero so that these estimated 
values are also the deviations from the expected. 



LB and M9 media), (2) Saccharomyces cerevisiae from 
Holstege et al. [53], (3) Drosophila melanogaster from 
Zhang et al. [54], (4) Caenorhabditis elegans from Roy 
et al. [55], and (5) Arabidopsis thaliana from Wuest et 
al. [56] (Additional file 1). We estimated CUB by CDC, 



scaled N c ', scaled N c and CAI, and correlated their esti- 
mates with gene expression levels in these five species 
(Table 6). 

On the whole, CDC outperformed scaled N c ' and scaled 
N c in correlating closely with gene expression level. 
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Figure 2 Codon usage bias across a range of sequence lengths. Sequences were simulated with the four non-uniform positional 
composition sets: Low (panel A), Med-1 (panel B), Med-2 (panel C) and High (panel D). Each estimate was determined based on 10000 replicate 
simulated sequences. The expected values of codon usage bias are zero for all examined cases. 
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Table 5 Differences between estimated and expected 



codon usage biases 



Expected CUB 


(Estimated CUB)" - (Expected CUB) 




CDC 


Scaled N c 


Scaled N c ' 


0.1 


0.00137 


0.60854 


0.61438 


0.2 


0.001 74 


0.47951 


0.52490 


0.3 


-0.00245 


0.38428 


0.43524 


0.4 


0.00186 


0.27647 


0.35793 


0.5 


-0.00060 


0.17750 


0.21 300 


0.6 


0.00437 


0.08031 


0.15215 


0.7 


0.00542 


0.01312 


0.06657 


0.8 


-0.00014 


0.04816 


-0.02663 


0.9 fa 



"Each estimate was averaged over 1000 replicate simulated sequences that 
each had 100000 codons. 

^Sequences with the expected codon usage bias at 0.9 were not possible to 
successfully simulate. 



Although CDC and scaled N c ' produced comparable cor- 
relation coefficients in yeast (detailed below), CDC exhib- 
ited larger correlation coefficients than scaled N c ' and 
scaled N c for all the rest cases (Table 6). When compar- 
ing CDC to CAI, we found comparable correlation coeffi- 
cients in E. coli (LB medium) and yeast, but in general 
CDC performed better than CAI (Table 6 and Additional 
file 1). However, it should be noticed that the values of 
CAI are calculated from expression data (since it requires 
a reference set of highly- expressed genes), whereas those 
of CDC are not. When we restricted the above analysis to 
the top 10% genes referring to their expression levels, 
CDC continued to perform better than scaled N c ', scaled 
N c , and CAI (Additional file 1). In addition, considering 
the correlation coefficients among these five species, we 
found that the smallest values always belonged to A. 
thaliana (regardless of metric used), indicating relatively 
weaker selection on A. thaliana codon usage by compari- 
son with those of the other four species (Table 6). Such 
phenomenon was discovered previously in a comparative 
analysis between A. thaliana and Oryza sativa [57]. 
Overall, CDC correlated positively with gene expression 
level, much better than scaled N c ', scaled N c , and CAI. 



As noted, the correlation coefficients produced by CDC 
and scaled NJ were similar in yeast but different in others 
(Table 6). Since CDC takes positional GC and purine 
contents as BNC and N c ' considers only GC content as 
BNC and ignores positional heterogeneity, this result can 
be probably explained by relatively lower heterogeneity of 
positional BNCs in yeast. To further investigate this pos- 
sibility, we examined the heterogeneities of positional GC 
and purine contents in these five species (Figure 3). Con- 
sistent with our expectation, heterogeneities of positional 
GC contents were indeed lower in yeast by comparison 
with other species (Figure 3A to 3C), especially at the 
second and third codon positions. In contrast, higher het- 
erogeneities of positional GC contents were apparent in 
E. coli (Figure 3A and 3B for the first and second codon 
positions, respectively) and D. melanogaster (Figure 3B 
and 3C for the second and third codon positions, respec- 
tively). These results agree well with the observation that 
the difference of correlation coefficient between CDC 
and scaled N c ' in yeast was smaller than that in E. coli or 

D. melanogaster (Table 6). As a consequence, CDC cor- 
related more closely with scaled N c ' in yeast than in 

E. coli or D. melanogaster (Figure S13 in Additional file 
1). In contrast to GC content, heterogeneities of posi- 
tional purine contents were relatively smaller and similar 
among the five species tested, presumably attributable to 
the fact that GC content ranges more broadly (20%-80%) 
than purine content (40%-60%) [48,58,59]. 

We proceeded to calculate CDC values (as well as GC 
and purine contents) for all E. coli genes (Additional 
file 2). CDC values ranged from 0.046 to 0.550 and the 
mean and median values were 0.239 and 0.187, respec- 
tively (Figure 4). The majority of genes (69%) exhibited 
CDC values between 0.15 and 0.25. The gene with the 
highest CDC value is trpL, a key component in the 
attenuation system that controls the expression of the 
trpLEDCBA operon in response to tryptophan availability 
[60]. However, bootstrap resampling illustrates that the 
CUB value of trpL gene is not statistically significant (P = 
0.77), most likely due to its short length (14 aa), consistent 
with our simulation results that short sequences tend to 
have biased CUB estimates. The gene with the highest 



Table 6 Correlation coefficients of codon usage bias with gene expression level 



Data" 




£ co// 1 


S. cerevisiae 2 


D. melanogaster 3 


C. elegans* 


A. thaliana 5 




LB (n = 1762 b ) 


M9 (n = 2766 b ) 


(n = 5142 b ) 


(n = 1651 b ) 


(n = 12184 fa ) 


(n = 1332 b ) 


CDC" 


0433 


0.367 


0.654 


0.460 


0.374 


0.228 


Scaled N c c 


0.315 


0.187 


0.664 


0.302 


0.328 


0.130 


Scaled N c c 


0.257 


0.125 


0.600 


0.321 


0.192 


0.063 


CAI C 


0.443 


0.288 


0.675 


0.386 


-0.118 


0.034 



Expression data were obtained from Bernstein et al., 2 Holstege et al., 3 Zhang et al., 4 Roy et al., and 5 Wuest et al. (see details in Additional file 1). 
^Number of genes (n). 
C P < 0.0001 for all values. 
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Figure 3 Heterogeneity of positional background nucleotide compositions in E. coli (2,766 genes in M9 medium), S. cerevisiae (5,142 
genes), D. melanogaster (1,651 genes),C. elegans (12,184 genes), and A. thaliana (1,332 genes). Heterogeneities of positional GC contents 
are represented by absolute differences between overall GC content and its positional contents: GC-GC1 for the first position (panel A), GC-GC2 
for the second position (panel B), and GC-GC3 for the third position (panel C), respectively. Likewise, heterogeneities of positional purine content 
are absolute differences between overall purine (AG) content and its positional contents: AG-AG1 for the first position (panel D), AG-AG2 for the 
second position (panel E), and AG-AG3 for the third position (panel F), respectively. 



CDC value and statistical significance in CUB is 
rpml (CDC = 0.481), which encodes ribosomal protein 
L35. In contrast, scaled N c ' and scaled N c identified rplL 
(encoding the ribosomal protein L7/L12) and eno (catalyz- 
ing the interconversion of 2-phosphoglycerate and 
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Figure 4 Comparison of CDC distributions between ribosomal 
protein (54 RP genes vary from 0.244 to 0.481) genes and all 
genes (4,144 genes range from 0.046 to 0.550) in E. coli. 



phosphoenolpyruvate) genes, respectively, as having the 
strongest CUBs (Additional file 2). 

Ribosomal protein (RP) genes are, in general, both essen- 
tial and highly expressed, and it is believed that their CUB 
values are greater than those of other genes [61]. In the 
case of E. coli, CDC values for 54 RP genes vary from 
0.244 to 0.481, larger than the mean and median values of 
all E. coli genes (Figure 4). Nearly all RP genes have statis- 
tically significant CUBs, with three exceptions (Additional 
file 3): (1) rpmE: CDC = 0.267, P = 0.1136; encoding RP 
L31, which may be loosely associated with ribosome [62], 

(2) rpmF: CDC = 0.329, P = 0.1096; encoding RP L32, 
which locates near the peptidyltransferase center [63], and 

(3) rpmj: CDC = 0.422, P = 0.0564; encoding RP L36, 
which is non-essential for protein synthesis [64]. These 
results suggest that an accurate measure such as CDC has 
the potential to illuminate the evolutionary process that 
has operated on each gene. 

Conclusions 

In summary, we have described a novel measure of CUB, 
the Codon Deviation Coefficient. As validated by simu- 
lated sequences and empirical data, CDC outperforms 
other measures by providing informative estimates of 
CUB and its statistical significance. CDC features no 
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necessity for any prior knowledge regarding gene expres- 
sion or function, properly accounts for BNC, and utilizes 
a bootstrap assessment to evaluate the statistical signifi- 
cance of CUB. Therefore, CDC promises a significant 
advance in raw analysis of codon usage, providing the 
means to better reveal aspects of the historical evolution- 
ary pressures on gene function without the assumptions 
of underlying reference data sets. 

Additional material 



Abbreviations 

CUB: Codon Usage Bias; CDC: Codon Deviation Coefficient; BNC: Background 
Nucleotide Composition; PCS: Positional Composition Set; A Adenine 
content; T. Thymine content; 6: Guanine content; C: Cytosine content; 5: GC 
content; ft Purine content; A,, 7), G„ Q, S h ft, A, T, G, C, 5, R at codon position 
/', respectively, where i — 1, 2, 3. 
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